P1

(a)

  P.triple <- 10 / (15 + 25 + 10)
  P.milk = 25 / (15 + 25 + 10)
  P.triple.or.milk <- P.triple + P.milk

The probability of Yoyo getting a Triple Chocolate M&M or a Milk Chocolate M&M is 0.7.

(b)

  P.first.getting.triple <- 10 / (15 + 25 + 10)
  P.then.not.getting.dark <- (49 - 15) / 49
  P.first.and.then = P.first.getting.triple * P.then.not.getting.dark

The probability of Yoyo first getting a Triple Chocolate M&M (she will eat it) and Lily then not getting a Dark Chocolate M&M from the bag is 0.1387755.

(c)

  P.two.triple.rep.small <- 1/2 * 1/2

The probability of sampling two Triple Chocolate M&Ms in a row when sampling with replacement is 0.25.

  P.two.triple.norep.small <- 1/2 * 9/19

The probability when sampling without replacement is 0.2368421.

(d)

  P.two.triple.rep.big <- 1/2 * 1/2

The probability of sampling two Triple Chocolate M&Ms in a row when sampling with replacement is 0.25.

  P.two.triple.norep.big <- 1/2 * 999/1999

The probability when sampling without replacement is 0.2498749.

## constructs empty vectors
P.rep = c()
P.norep = c()
sample.size = c()

increment <- (2000 - 20) / 20

## calculate each distribution with different sample size
for (spl.size in seq(20, 2000, increment)) {
  sample.size <- append(sample.size, spl.size)
  ## the probability of every single choosing is the same with replacement,   so just simply square it for chossing twice.
  P.rep <- append(P.rep, ((spl.size / 2) / spl.size)^2)
  P.norep <- append(P.norep, (spl.size / 2) / spl.size * (spl.size / 2 - 1)   / (spl.size -1))
}

plot_data <- data.frame(sample.size, P.rep, P.norep)

## Modify x and y labels
x <- list(title = "Total Number of M%M")
y <- list(title = "Probabilities")

plot_ly(plot_data, x = ~sample.size, y = ~P.rep, name = 'with_replaxement', type = 'scatter', mode = 'lines') %>%
  add_trace(y = ~P.norep, name = 'without_replacement', mode = 'lines') %>%
  layout(xaxis = x, yaxis = y)

Even though as the sample size gets larger and larger, the probability of sampling without replacement gets closer to sampling with replacement, which is considered as independent, sampling without replacement, to some extent, does affect later events since the sample size changes. It doesn’t meet the definition of independent events that when two events are said to be independent of each other, the probability that one event occurs in no way affects the probability of the other event occurring.

P2

(a)

x<-seq(500,1500, length=2000) ## get writing speed ranginf from 500 to 1500
density <- dnorm(x, 1000, 150)
plot(x, density, col="darkgreen",xlab="paper writing speed (words per hour)", ylab="Density", type="l",lwd=2, cex=2, main="Distribution of paper writing speed (wph) of students at ULE", cex.axis=.8)

(b)

z.score <- (1068 - 1000) / 150

The Z-Score for Yoyo’s writing speed is 0.4533333. Since the significance level is not given, let’s assume it to be 0.05, which is widely used. From the Z table, we find that the p-value is 0.325166 given the Z-Score. The Z-Score tells us that Yoyo’s writing speed is 0.4533333 standard deviation above the mean. The associated p-value tells us that Yoyo’s writing speed is faster than 0.5 + 0.325166 = 82.5166% of the population.

(c)

  • First, we define the null hypothesis to be that the use of M&Ms has no effect the students’ paper writing speed, and alternative hypothesis to be that the use of M&Ms affects the students’ paper writing speed.

  • Second, we determine the significance level alpha to be 0.05

  • Third, we load the data and put ‘present’ and ‘absent’ data into to vectors, and run the t.tset() to get test stats

paperNinja <- read.csv("./paperNinja.csv", header=TRUE, sep=",")
MnM.present <- c()
MnM.absent <- c()

for(i in 1:nrow(paperNinja)) {
    row <- paperNinja[i,]
    if (row$MMs == 0) { ## if MMs column is 0, put into MnM.present vector
      MnM.present[[row$ID]] <- row$WPH
    } else { ## otherwise, put into MnM.present vector
      MnM.absent[[row$ID]] <- row$WPH
    }
}

t.test(MnM.present, MnM.absent, conf.level = 0.95, paired = TRUE)
## 
##  Paired t-test
## 
## data:  MnM.present and MnM.absent
## t = 0.93915, df = 19, p-value = 0.3594
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -20.17415  53.01415
## sample estimates:
## mean of the differences 
##                   16.42
  • In this paired, two-sided t-test, since the p-value is 0.3594, which is higher than 0.05, we cannot reject the null hypothesis that the use of M&Ms has no effect the students’ paper writing speed. Thus, we conclude that statistically the use of M&Ms has no effect the students’ paper writing speed.

P3

(a)

I like the EDA that explores what was the worst day to fly out of NYC in 2013 if you dislike delayed flights since the question can potentially reveal something about that paticular day. It might be related to some events, extreme weather conditions, or social/political reasons. Besides, if we know the specific day and find certain pattern about that day, we can better prepare for it and prevent it from happening. As customers, we can try to book flight on that day in the future as well.

(b)

As a follow-up of the last question, I want to explore more about what are the worst days to fly out of NYC and are there specific patterned we can reveal. It’s interesting to know that the worst day in 2013 was Sept 12, but querying more worst days, probably we can get more insights on this topic.

# Stardard libraries
library(nycflights13)


#loading the flights data
flights %>%  
  #Selecting only the columns required-Month, Day, average Departure delay
  #and average arrival delay
  select(month, day, arr_delay, dep_delay) %>% 
  
  #We will just focus on positive delays. This means that we DON'T consider
  #early arrivals and departures as 'Delays' as it is intuitively incorrect.
  filter(arr_delay >= 0, dep_delay >= 0) %>%
  
  #Grouping by Month & Day as we want other operations to find the mean delay
  #for each day of the  month in the year 2013.
  group_by(month, day) %>%
  
  #Now we summarize the entire delay into a single column (avg_delay) by adding 
  #up the mean of the departure & arrival delays for each day of the month. We 
  #should have ideally 365 rows after this operation
  summarise(avg_delay =  mean(arr_delay, na.rm = TRUE) + 
              mean(dep_delay, na.rm = TRUE)) %>%
  
  #We now ungroup the grouped data set as we want to now calculate the highest 
  #delay amongst all the 365 days rather than just in a particular month
  ungroup() %>%
  
  #Now we arrange the delays in descending order to get the highest delay in the  
  #first row of our data set.
  arrange(-avg_delay) %>%
  
  #Selecting the first few rows of the data set. Which shows the highest delay and the 
  #month and day corresponding to it.
  head(20)
## # A tibble: 20 x 3
##    month   day avg_delay
##    <int> <int>     <dbl>
##  1     9    12  228.8011
##  2     9     2  223.1130
##  3     7    10  220.1812
##  4     4    10  217.2589
##  5     7    22  209.8665
##  6     3     8  209.1633
##  7     5    23  197.5820
##  8     6    24  197.0659
##  9    12     5  194.9683
## 10     7    28  190.2995
## 11     7     7  180.3283
## 12     8    28  179.3462
## 13     4    19  175.7979
## 14     6    27  173.4719
## 15     6    30  172.2966
## 16     8     8  171.8386
## 17     6    28  170.9259
## 18     7     1  167.9630
## 19     6     2  166.1901
## 20     6    13  164.6831

As we queried the top 20 days with most delays, hardly can we find a noticable pattern from these dates. we could conclude that there’s no obvious association between the dates of a year themselves and the likelyhood of having flights delayed on that day.

P4

(a)

courseSuccess <- read.csv("./courseSuccess.csv", header=TRUE, sep=",")
str(courseSuccess) # get stats including dimentions and variable types of the data
## 'data.frame':    30 obs. of  6 variables:
##  $ Gender    : int  1 1 0 1 1 1 0 1 0 0 ...
##  $ WatchTV   : int  1 1 0 0 1 1 1 1 1 1 ...
##  $ GPA       : num  3 3.6 3.2 3.4 3.1 3 3.6 3.4 3.3 3.4 ...
##  $ HardWork  : int  3 4 2 3 4 1 4 2 2 3 ...
##  $ SleepHours: num  7.3 6.5 6 7 6.3 7.3 6.1 5.6 6.2 6.8 ...
##  $ PrevGPA   : num  3.3 3.5 0 3.6 3.1 3.6 3.6 3.7 0 3.6 ...
## clean data by making sure that there's no misssing and duplicate value
courseSuccess <- unique(courseSuccess[complete.cases(courseSuccess), ])

(b)

hist(courseSuccess$GPA, breaks=20, col="grey", xlab="GPA", 
    main="Distribution of GPA", xlim = c(2,5), ylim = c(0,12)) 

The graph shows that GPAs generally range from 2.8 to 3.6 and number of students get fewer as the GPA goes high. However, there’s a outlier of 4.9 GPA. Considering that we are collecting GPA on a 4.0 scale, this might be a data error.

(c)

linearMod <- lm(GPA ~ HardWork, data=courseSuccess)
print(linearMod)
## 
## Call:
## lm(formula = GPA ~ HardWork, data = courseSuccess)
## 
## Coefficients:
## (Intercept)     HardWork  
##      2.6539       0.2021
  • The linear model: GPA = 2.6539 + 0.2021 * HardWork.

  • The Beta0 2.6539 means that we would expect an average GPA 2.6539 for a student in the class with no HardWork at all.
  • The Beta1 0.2021 means that if HardWork differed by one unit, GPA will differ by 0.2021 units, on average.

plot(courseSuccess$GPA, courseSuccess$HardWork, pch = 16, cex = 1.3, col = "blue", main = "GPA ~ HardWork", xlab = "GPA", ylab = "HardWork")
abline(linearMod) ## linear regression line

p.value <- anova(linearMod)$'Pr(>F)'[1] ## get p-value

Based on the visualization and p-value 0.0058007, which is larger than 0.05, the correlation is not statistically significant.

(d)

## get Multiple Linear Regression
fit <- lm(courseSuccess$GPA ~ Gender + WatchTV + HardWork + SleepHours + PrevGPA, data=courseSuccess)
summary(fit) # show results
## 
## Call:
## lm(formula = courseSuccess$GPA ~ Gender + WatchTV + HardWork + 
##     SleepHours + PrevGPA, data = courseSuccess)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65179 -0.16697  0.00364  0.09330  1.21813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.30299    0.72700   5.919 4.16e-06 ***
## Gender      -0.18096    0.15134  -1.196  0.24347    
## WatchTV      0.05842    0.13276   0.440  0.66385    
## HardWork     0.20593    0.06775   3.039  0.00565 ** 
## SleepHours  -0.25697    0.10360  -2.480  0.02053 *  
## PrevGPA      0.04283    0.05496   0.779  0.44344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3477 on 24 degrees of freedom
## Multiple R-squared:  0.4073, Adjusted R-squared:  0.2839 
## F-statistic: 3.299 on 5 and 24 DF,  p-value: 0.02083
  • The multiple regression model: GPA = 4.30299 - 0.18096 * Gender + 0.05842 * WatchTV + 0.20593 * HardWork - 0.25697 * SleepHours + 0.04283 * PrevGPA.
  • Based on the summary, HardWork (**) and SleepHours (*) have the most small p-value, so for these two predictors, we can reject the null hypothesis H0 Beta SleepHours = 0 and H0 Beta HardWork = 0.

(e)

Since based on the stats from previous model, we can say that Gender and WatchTV have relatively little correlation to GPA, we will explore different combinations of the three predictors: HardWork, SleepHours, and GPA:

Model A predictors: WatchTV, HardWork, SleepHours

# Multiple Linear RegressionModel A
another_fit_a <- lm(courseSuccess$GPA ~ WatchTV + HardWork + SleepHours, data=courseSuccess)
summary(another_fit_a) # show results
## 
## Call:
## lm(formula = courseSuccess$GPA ~ WatchTV + HardWork + SleepHours, 
##     data = courseSuccess)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56546 -0.14301 -0.01840  0.08356  1.32242 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.10857    0.66567   6.172 1.58e-06 ***
## WatchTV      0.03279    0.12981   0.253  0.80257    
## HardWork     0.22509    0.06516   3.454  0.00191 ** 
## SleepHours  -0.22877    0.09939  -2.302  0.02961 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3444 on 26 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.2973 
## F-statistic: 5.091 on 3 and 26 DF,  p-value: 0.006627

Model B predictors: HardWork, SleepHours, PrevGPA

# Multiple Linear RegressionModel B
another_fit_b <- lm(courseSuccess$GPA ~ HardWork + SleepHours + PrevGPA, data=courseSuccess)
summary(another_fit_b) # show results
## 
## Call:
## lm(formula = courseSuccess$GPA ~ HardWork + SleepHours + PrevGPA, 
##     data = courseSuccess)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58367 -0.13896 -0.00374  0.08514  1.32881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.05427    0.69019   5.874 3.41e-06 ***
## HardWork     0.22684    0.06486   3.497  0.00171 ** 
## SleepHours  -0.22443    0.09903  -2.266  0.03199 *  
## PrevGPA      0.01392    0.04892   0.285  0.77823    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3443 on 26 degrees of freedom
## Multiple R-squared:  0.3704, Adjusted R-squared:  0.2978 
## F-statistic:   5.1 on 3 and 26 DF,  p-value: 0.006573

Model C predictors: HardWork, SleepHours

# Multiple Linear Regression Model C
another_fit_c <- lm(courseSuccess$GPA ~ HardWork + SleepHours, data=courseSuccess)
summary(another_fit_c) # show results
## 
## Call:
## lm(formula = courseSuccess$GPA ~ HardWork + SleepHours, data = courseSuccess)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58379 -0.13169 -0.01945  0.09469  1.33483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.10644    0.65398   6.279 1.02e-06 ***
## HardWork     0.22662    0.06374   3.555  0.00142 ** 
## SleepHours  -0.22621    0.09714  -2.329  0.02761 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3384 on 27 degrees of freedom
## Multiple R-squared:  0.3685, Adjusted R-squared:  0.3217 
## F-statistic: 7.877 on 2 and 27 DF,  p-value: 0.002019

From the three models, we can pick the Model C since it has the most Adjusted R-squared and least Residual standard error.

(f)

Comparing these two models from (d) and (e):

    1. GPA = 4.30299 - 0.18096 * Gender + 0.05842 * WatchTV + 0.20593 * HardWork - 0.25697 * SleepHours + 0.04283 * PrevGPA
    1. GPA = 4.10644 + 0.22662 * HardWork - 0.22621 * SleepHours
  • The Adjusted R-squared of (d) is 0.2839 and Residual standard error is 0.3477.
  • The Adjusted R-squared of (e) is 0.3217 and Residual standard error is 0.3384.

We find that the (e) would be a better model. Compared with those in (d), HardWork in (e) has a stronger correlation and SleepHours in (e) has a weaker correlation to GPA. I wonder why would the pattern makes the (e) model overall better and are these two predictors somehow correlated.

(g)

 plot(fitted(another_fit_c), residuals(another_fit_c))

As we plot out the residual vs fitted graph, we can see that the residuals bounce randomly around the 0 line. The residuals roughly form a “horizontal band” around the 0 line: [-0.5, 0.5]. However, we also see that there’s a dot above 1, which stands out from the basic random pattern of residuals. This suggests that there is a outlier.